Use this template to complete your project throughout the course. Your Final Project presentation will be based on the contents of this document. Replace the title/name above and text below with your own, but keep the headers. Feel free to change the theme and other display settings, but this is not required.
#Give a brief a description of your project and its goal(s), what data you are using to complete it, and what two faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository. Social determinants of health, such as inadequate housing, lack of access to reliable transportation, and insufficient food access, have increasingly been recognized as being factors that adversely impact health outcomes (https://health.gov/healthypeople/priority-areas/social-determinants-health). In this project, with guidance from advisors John Holmes, PhD and Sameh Saleh, MD, I will attempt to determine if there is an observed association between low food access and type 2 diabetes prevalence in the United States for the most recent year for which data about both were available, 2015. It is possible that there is a positive association between decreased food access and type 2 diabetes prevalence, as individuals in low food access areas may have little choice but to rely on convenience foods and fast foods for their nutritional intake, which is likely to increase the risk of developing type 2 diabetes. I will use the AHRQ Social Determinants of Health dataset for this work.
Link to final project Github repository: https://github.com/mcgreevj/BMIN503_Final_Project
#Describe the problem addressed, its significance, and some background to motivate the problem.
#Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff. Type 2 diabetes is a major health problem in the United States, with significant morbidity and mortality for the at least 33 million people who suffer with this condition. Additionally, the economic costs of type 2 diabetes in the aggregate are substantial, totaling $327 billion in 2017 (The Cost of Diabetes | ADA). Tragically, type 2 diabetes is largely a preventable disease that has modifiable risk factors (weight, diet, exercise). Increasingly, there has been recognition that social determinants of health can serve as barriers that prevent individuals from being able to successfully modify risk factors. For example, individuals who lack access to insurance and/or reliable transportation may not be able to seek routine medical care for education, prevention, monitoring, and treatment purposes. As noted, diet is a major factor in the development of type 2 diabetes. And while ideal diets to prevent type 2 diabetes have been advised, not every American necessarily has access to food to be able to consume these recommended diets. This is the phenomenon of food insecurity. Food insecurity is formally defined by the US Department of Agriculture as: “the limited or uncertain availability of nutritionally adequate and safe foods, or limited or uncertain ability to acquire acceptable foods in socially acceptable ways.” (USDA ERS - Measurement). One factor that can lead to lack of access to healthy foods is the phenomenon of food deserts – areas that are low income and that have few or no supermarkets or other sources of healthy food options. As formally defined by the USDA, a food desert is: “A census tract that meets both low-income and low-access criteria including: 1. poverty rate is greater than or equal to 20 percent OR median family income does not exceed 80 percent statewide (rural/urban) or metro-area (urban) median family income; 2. at least 500 people or 33 percent of the population located more than 1 mile (urban) or 10 miles (rural) from the nearest supermarket or large grocery store.” In the absence of options for healthy food, individuals residing in these communities may need to resort to convenience and fast foods for some or many of their meals. In doing so, they increase the risk of developing type 2 diabetes.
This project is interdisciplinary in that it combines census tract, geographical data with health survey data. Notably the AHRQ SDOH database integrates data from multiple surveys and sources. The fields and topics that contribute to understanding of this issue are multiple: public health, clinical medicine, sociology, health disparities, economics, geography, history (in that the reasons many communities have limited access to food involves historical structural decisions, including forms of structural racism such as redlining). Moreover, the findings of this project have social justice, ethical, cultural, economic, and political implications. Based on collaboration with my mentors, I have learned the importance of familiarizing myself with data sources and data definitions, notably the Data Source Documentation guide from AHRQ (AHRQ Social Determinants of Health (SDOH) Database). This step is necessary from an exploratory standpoint to contemplate what kind of project I might do and also from a practical standpoint to understand how to properly and specifically answer my project question with the available data. I have also learned that some questions may be challenging to answer exactly as initially envisioned. For example, when considering this project at first, I had hoped to show changes over time in the association between limited food access and type 2 diabetes prevalence, perhaps including recent data from 2020 or 2021. However, after close inspection of this SDOH data set, I have realized that not all data is available for every year of interest. Thus I have needed to narrow the scope of my project a bit, given that data for limited food access and diabetes prevalence intersect in 2015, which is the most recent year that both types of data are available.
The identification of an independent association between low food access and type 2 diabetes prevalence would help support the case for policies and programs (both at the national and state levels) to improve food access in impacted census tracts. From a social justice standpoint, those with power in society (elected office holders, business and community leaders, members of the health care community including researchers) have a responsibility to seek ways to provide a known and effective preventive therapy (whether a vaccine or a grocery store) to those without such access currently, to prevent disease. There is also a national public health and cost dimension to this topic. People with diabetes suffer complications and morbidity related to diabetes, impairing their ability to support themselves, live independent lives, and contribute to a community’s economic activity and well-being. At the same time, the national cost of providing care for individuals with type 2 diabetes is borne by all Americans, in part because individuals living in low access food areas are more likely to have public (Medicaid, Medicare) insurance, if they have insurance at all, that is funded by all U.S. taxpayers. Indeed, Medicaid and Medicare combined provide health coverage to approximately 36% of Americans today and that percentage continues to rise over time. (Health Insurance Coverage of the Total Population | KFF). Data Note: Three Findings about Access to Care and Health Outcomes in Medicaid (kff.org). “Uncompensated” care for individuals with type 2 diabetes who lack any type of insurance is also a cost borne by individual health care organizations, governments, and ultimately, tax payers. Greater awareness of the relationships between SDOH and health outcomes, combined with a blunt accounting of the costs of inaction on this subject, will hopefully spur actions that can help mitigate SDOH and improve both individual and community health.
For this project, I used the AHRQ Social Determinants of Health database (available at https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html). I reviewed the data source documentation document (https://www.ahrq.gov/sites/default/files/wysiwyg/sdoh/SDOH-Data-Sources-Documentation-v1-Final.pdf) to learn about potential variables for this project and their availability. Ultimately, I planned to use two main variables available in the SDOH database: CHR_PCT_DM and FARA_TRACT_LILA_HALF_10.
CHR_PCT_DM is defined as the “percentage of adults with diagnosed diabetes (ages 20 and over).” Data for this variable is provided by the CDC Interactive Diabetes Atlas which in turn uses BRFSS data to estimate prevalence by U.S. county. Notably, starting in 2015, BRFSS included both landline and cellphone survey data to obtain diabetes prevalence data. Previously, this data had been collected exclusively via landline surveys.
FARA_PCT_LA_HALF_10_POP is defined as the “percentage of [healthy food] low access population measured at 1/2 mile for urban areas and 10 miles for rural areas.” For this variable, low access to healthy food is defined as being far from a supermarket, supercenter, or large grocery store, so more than 1/2 mile in urban areas and more than 10 miles in rural areas for this variable. #Does low access first require low-income designation for a tract, or is low access independent of low-income status for a tract? That is, a tract could be LA and not be LI?
This variable comes from the Food Access Research Atlas (FARA), created by the Economic Research Service (ERS) at the United States Department of Agriculture. A mapping tool is offered by the ERS and is available here: https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas/.
In contrast to the diabetes prevalence data, FARA data is provided at the census tract level.
Of note, the following is the ERS methodology for calculating distance to the nearest grocery store, excerpted from the ERS definitions document available at https://www.ers.usda.gov/data-products/food-access-research-atlas/documentation/:
“First, the entire country is divided into ½-kilometer (about 1/3-mile)-square grids, and data on the population are aerially allocated to these grids (see Low-Income and Low-Foodstore-Access Census Tracts, 2015-19 in the link below). Distance to the nearest supermarket is measured for each grid cell by calculating the distance between the geographic center of the ½-kilometer square grid that contains estimates of the population (number of people and other subgroup characteristics) and the center of the grid with the nearest supermarket.”
While I had hoped to be able to include recent data for this project, the most recent year that data for both CHR_PCT_DM and FARA_PCT_LA_HALF_10_POP is available is 2015.
After having consulted with my advisors (Dr. Holmes and Dr. Saleh) and choosing the key variables above, I loaded the appropriate datasets into R.
For CHR_PCT_DIABETES, the 2015 data (by county) was available as an Excel file here: https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fwww.ahrq.gov%2Fsites%2Fdefault%2Ffiles%2Fwysiwyg%2Fsdoh%2FSDOH_2015_COUNTY_1_0.xlsx&wdOrigin=BROWSELINK.
For FARA_PCT_LA_HALF_10_POP, the 2015 data (by tract) was available as an Excel file here: https://www.ahrq.gov/downloads/sdoh/sdoh_2015_tract_1_0.xlsx.
Once loaded into R, I manipulated each of these raw datasets so as to limit them to the variables of interest.
For the tract dataset, I mutated it to create a new variable, MEAN_PCT_LA_COUNTY, that reported the mean low access percentage by county, using group by and summarize functions.
I joined the dataframes containing CHR_PCT_DIABETES and FARA_PCT_LA_HALF_10_POP, respectively, by COUNTYFIPS, which was present in both dataframes.
In preparation for an exploratory data analysis using choropleths, I loaded census county files using practicum code as a guide.
I created a basic U.S. choropleth of low access percentage by county. Similarly, I created abasic U.S. choropleth of diabetes prevalence by county.
I will plan to create a logistic regression model to determine if there is any relationship between low food access and diabetes prevalence. [Is there any way to graphically demonstrate a relationship]
I will consider adjusting the model for other potentially relevant variables such as income, age, obesity if possible.
Result of exploratory data analysis Result of logistic regression model Result after adjusting for other variables [preliminary code below]
Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
Limitations There was not an evident way to capture purely type 2 diabetes data from available surveys. Interestingly, the CHR_PCT_DIABETES variable counts any diagnosis of diabetes, whether type 1, type 2, or other. In this project, the diabetes of interest is type 2 diabetes given that type 2 diabetes is associated with modifiable lifestyle factors. Still, given that 90-95% of all U.S. diabetes cases are type 2 (https://www.cdc.gov/diabetes/basics/type2.html#:~:text=Healthy%20eating%20is%20your%20recipe,adults%20are%20also%20developing%20it), this project is still capable of providing a reasonably strong estimate of any association between low access and type 2 diabetes prevalence.
Conclusion
[Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.]
library("readxl")#Read in the 2015 low access data, reported by census tract
Lowaccess_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.sdoh_2015_tract_1_0.xlsx")
dim(Lowaccess_raw) #74133 x 405## [1] 74133 405
#Read in the 2015 diabetes prevalence data, reported by county
Diabetes_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.SDOH_2015_COUNTY_1_0.xlsx")## Warning: Expecting logical in RJ1671 / R1671C478: got '46123'
## Warning: Expecting logical in RJ1763 / R1763C478: got '32510'
## Warning: Expecting logical in RK1763 / R1763C479: got '41025'
## Warning: Expecting logical in RL1763 / R1763C480: got '41037'
## Warning: Expecting logical in RJ2797 / R2797C478: got '49017'
## Warning: Expecting logical in RK2797 / R2797C479: got '49019'
## Warning: Expecting logical in RL2797 / R2797C480: got '49025'
## Warning: Expecting logical in RM2797 / R2797C481: got '49055'
## Warning: Expecting logical in RJ2842 / R2842C478: got '51760'
dim(Diabetes_raw) #3234 x 979## [1] 3234 979
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(cowplot)
library(dplyr)
#Checking for countyfips discrepancies between raw datasets
setdiff(Lowaccess_raw$COUNTYFIPS, Diabetes_raw$COUNTYFIPS)## [1] "69085"
#COUNTYFIPS 69085 (Northern Mariana Islands) is the only difference. 69085 is represented in Lowaccess_raw but not in Diabetes_raw.
#Define "Lowaccess" as a subset of "Lowaccess_raw" to include TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP
Lowaccess<- dplyr::select(Lowaccess_raw, TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP)
dim(Lowaccess) #74133 x 9## [1] 74133 9
#Define "Diabetes" as a subset of "Diabetes_raw" to include COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES
Diabetes<- dplyr::select(Diabetes_raw, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES)
dim(Diabetes) #3234 x 25## [1] 3234 25
#Rename county population column in Diabetes "ACS_TOT_POP_WT" to "County_Pop"
Diabetes<- dplyr::rename(Diabetes, "County_Pop" = ACS_TOT_POP_WT)
head(Diabetes)## # A tibble: 6 × 25
## COUNTYFIPS STATE…¹ STATE COUNTY REGION Count…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alab… Autau… South 55221 37.7 92.4 51281 6.14
## 2 01003 01 Alab… Baldw… South 195121 42.2 92.5 50254 6.36
## 3 01005 01 Alab… Barbo… South 26932 38.8 82.4 32964 13.5
## 4 01007 01 Alab… Bibb … South 22604 38.9 91.7 38678 7.74
## 5 01009 01 Alab… Bloun… South 57710 40.7 92.3 45813 7.77
## 6 01011 01 Alab… Bullo… South 10678 39.3 82.0 31938 18.9
## # … with 15 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## # ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## # ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## # ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## # ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## # ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## # CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, and abbreviated …
dim(Diabetes) #3234 x 25## [1] 3234 25
#Innerjoin Diabetes and Lowaccess
test_innerjoin<- inner_join(Diabetes, Lowaccess, by = "COUNTYFIPS")
head(test_innerjoin)## # A tibble: 6 × 33
## COUNTYFIPS STATEFIPS.x STATE.x COUNT…¹ REGIO…² Count…³ ACS_M…⁴ ACS_P…⁵ ACS_M…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 2 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 3 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 4 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 5 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 6 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## # … with 24 more variables: ACS_PCT_HH_INC_10000 <dbl>,
## # ACS_PCT_HH_INC_100000 <dbl>, ACS_PER_CAPITA_INC <dbl>,
## # ACS_PCT_POV_AIAN <dbl>, ACS_MEDIAN_HOME_VALUE <dbl>,
## # ACS_PCT_MEDICAID_ANY <dbl>, ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>,
## # ACS_PCT_MEDICARE_ONLY <dbl>, ACS_PCT_PRIVATE_ANY <dbl>,
## # ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>, ACS_TOT_HH <dbl>,
## # ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>, …
dim(test_innerjoin) #74132 x 33## [1] 74132 33
#Inspect to see if Northern Mariana (69085) is present in test_innerjoin - should not be
View(test_innerjoin)
#Search for 69085 in "View(test_innerjoin)" table yields no results, so Northern Mariana is excluded, as desired#Create a new column to represent weighted population, "census_tract_pop_wt" which is tract population divided by county population.
test_innerjoin<- dplyr::mutate(test_innerjoin, "census_tract_pop_wt" = (ACS_TOT_POP_WT/County_Pop))
head(test_innerjoin)## # A tibble: 6 × 34
## COUNTYFIPS STATEFIPS.x STATE.x COUNT…¹ REGIO…² Count…³ ACS_M…⁴ ACS_P…⁵ ACS_M…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 2 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 3 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 4 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 5 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## 6 01001 01 Alabama Autaug… South 55221 37.7 92.4 51281
## # … with 25 more variables: ACS_PCT_HH_INC_10000 <dbl>,
## # ACS_PCT_HH_INC_100000 <dbl>, ACS_PER_CAPITA_INC <dbl>,
## # ACS_PCT_POV_AIAN <dbl>, ACS_MEDIAN_HOME_VALUE <dbl>,
## # ACS_PCT_MEDICAID_ANY <dbl>, ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>,
## # ACS_PCT_MEDICARE_ONLY <dbl>, ACS_PCT_PRIVATE_ANY <dbl>,
## # ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>, ACS_TOT_HH <dbl>,
## # ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>, …
dim(test_innerjoin) #74132 x 34## [1] 74132 34
#How many NAs are in the FARA_PCT_LA_HALF_10_POP column?
sum(is.na(test_innerjoin$FARA_PCT_LA_HALF_10_POP))## [1] 1625
#Remove #1625 FARA na's from test_innerjoin
test_innerjoin<- test_innerjoin %>% drop_na(FARA_PCT_LA_HALF_10_POP)
dim(test_innerjoin) #72507 x 34 (so all 1625 NA FARA_PCT_LA_HALF_10_POP values removed)## [1] 72507 34
View(test_innerjoin) #Visual inspection of full data confirms there are no NA values remaining in the FARA_PCT_LA_HALF_10_POP data column #Defining a new dataframe based on test_innerjoin that contains a newly-defined weighted mean column, where results are grouped by COUNTYFIPS (one row per county) and where other variables of interest for the analysis are included
colnames(test_innerjoin)## [1] "COUNTYFIPS" "STATEFIPS.x"
## [3] "STATE.x" "COUNTY.x"
## [5] "REGION.x" "County_Pop"
## [7] "ACS_MEDIAN_AGE" "ACS_PCT_EMPLOYED"
## [9] "ACS_MEDIAN_HH_INC" "ACS_PCT_HH_INC_10000"
## [11] "ACS_PCT_HH_INC_100000" "ACS_PER_CAPITA_INC"
## [13] "ACS_PCT_POV_AIAN" "ACS_MEDIAN_HOME_VALUE"
## [15] "ACS_PCT_MEDICAID_ANY" "ACS_PCT_MEDICAID_ANY_BELOW64"
## [17] "ACS_PCT_MEDICARE_ONLY" "ACS_PCT_PRIVATE_ANY"
## [19] "ACS_PCT_UNINSURED" "CEN_AIAN_NH_IND"
## [21] "ACS_TOT_HH" "ACS_AVG_HH_SIZE"
## [23] "ACS_PCT_CHILD_1FAM" "CHR_PCT_ADULT_OBESITY"
## [25] "CHR_PCT_DIABETES" "TRACTFIPS"
## [27] "STATEFIPS.y" "STATE.y"
## [29] "COUNTY.y" "REGION.y"
## [31] "ACS_TOT_POP_WT" "FARA_TRACT_LILA_HALF_10"
## [33] "FARA_PCT_LA_HALF_10_POP" "census_tract_pop_wt"
test_innerjoin_wtmean<- test_innerjoin %>% dplyr::group_by(COUNTYFIPS, COUNTY.x, STATE.x, REGION.x, County_Pop, ACS_MEDIAN_HH_INC, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_UNINSURED, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES) %>% dplyr::summarise(wm_FARA = weighted.mean(FARA_PCT_LA_HALF_10_POP, census_tract_pop_wt)) %>% as.data.frame()## `summarise()` has grouped output by 'COUNTYFIPS', 'COUNTY.x', 'STATE.x',
## 'REGION.x', 'County_Pop', 'ACS_MEDIAN_HH_INC', 'ACS_PER_CAPITA_INC',
## 'ACS_PCT_POV_AIAN', 'ACS_MEDIAN_HOME_VALUE', 'ACS_PCT_UNINSURED',
## 'CHR_PCT_ADULT_OBESITY'. You can override using the `.groups` argument.
head(test_innerjoin_wtmean)## COUNTYFIPS COUNTY.x STATE.x REGION.x County_Pop ACS_MEDIAN_HH_INC
## 1 01001 Autauga County Alabama South 55221 51281
## 2 01003 Baldwin County Alabama South 195121 50254
## 3 01005 Barbour County Alabama South 26932 32964
## 4 01007 Bibb County Alabama South 22604 38678
## 5 01009 Blount County Alabama South 57710 45813
## 6 01011 Bullock County Alabama South 10678 31938
## ACS_PER_CAPITA_INC ACS_PCT_POV_AIAN ACS_MEDIAN_HOME_VALUE ACS_PCT_UNINSURED
## 1 24974 52.20 141300 10.12
## 2 27317 9.15 169300 12.96
## 3 16824 0.00 92200 15.51
## 4 18431 4.65 102700 9.66
## 5 20532 6.60 119800 11.64
## 6 17580 0.00 68600 18.09
## CHR_PCT_ADULT_OBESITY CHR_PCT_DIABETES wm_FARA
## 1 37.5 14.2 54.81617
## 2 31.0 11.3 40.21763
## 3 44.3 18.0 31.32199
## 4 37.8 14.9 1.18436
## 5 34.4 14.3 13.44140
## 6 39.4 20.0 70.79069
dim(test_innerjoin_wtmean) #3140 x 13## [1] 3140 13
#Noting that Diabetes had 3234 rows and that test_innerjoin_wtmean has only 3140 rows
setdiff(Diabetes$COUNTYFIPS, test_innerjoin_wtmean$COUNTYFIPS)## [1] "02158" "46102" "60000" "60010" "60020" "60030" "60040" "60050" "66010"
## [10] "69000" "69100" "69110" "69120" "72001" "72003" "72005" "72007" "72009"
## [19] "72011" "72013" "72015" "72017" "72019" "72021" "72023" "72025" "72027"
## [28] "72029" "72031" "72033" "72035" "72037" "72039" "72041" "72043" "72045"
## [37] "72047" "72049" "72051" "72053" "72054" "72055" "72057" "72059" "72061"
## [46] "72063" "72065" "72067" "72069" "72071" "72073" "72075" "72077" "72079"
## [55] "72081" "72083" "72085" "72087" "72089" "72091" "72093" "72095" "72097"
## [64] "72099" "72101" "72103" "72105" "72107" "72109" "72111" "72113" "72115"
## [73] "72117" "72119" "72121" "72123" "72125" "72127" "72129" "72131" "72133"
## [82] "72135" "72137" "72139" "72141" "72143" "72145" "72147" "72149" "72151"
## [91] "72153" "78010" "78020" "78030"
#There are 94 counties in Diabetes that are not in test_innerjoin_wtmean
View(Diabetes)
#These counties appear to be in Alaska and U.S. territories. Examples: 02158 (Kusilvak Census Area, AK), 60010 (Am Samoa), 72151 (Puerto Rico), 78030 (US Virgin Islands). Excluding these counties from test_innerjoin_wtmean is acceptable given that we do not have the ability to plot choropleths (see counties rds file below) with these non-continental U.S. counties.#Loading libraries, county geography data in prep for exploratory data analysis, including choropleths:
library(rgdal)## Warning: package 'rgdal' was built under R version 4.2.2
## Loading required package: sp
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(leaflet)## Warning: package 'leaflet' was built under R version 4.2.2
library(ggplot2)
library(RColorBrewer)
library(plyr)## Warning: package 'plyr' was built under R version 4.2.2
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(rgdal)
library(sf)## Warning: package 'sf' was built under R version 4.2.2
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidycensus)## Warning: package 'tidycensus' was built under R version 4.2.2
library(ggsn)## Warning: package 'ggsn' was built under R version 4.2.2
## Loading required package: grid
library(leaflet)
library(ggplot2)
library(RColorBrewer)
counties<- readRDS(gzcon(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds")))
st_crs(counties) <- st_crs(counties)## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
#using base plot function to check that counties contains polygon data:
plot(counties)#using ggplot to check that counties contains polygon data:
ggplot() +
geom_sf(data = counties)#Joining test_innerjoin and counties but first need to create a common variable by which to join. Will convert State to a 2 digit number in Counties, then mutate State and County into a 5 digit number, a mutated variable in Counties called "COUNTYFIPS" to correspond to "COUNTYFIPS" in the combined data set, test_innerjoin.
counties$STATE<- str_pad(counties$STATE, width=2, side="left", pad="0")
head(counties)## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry
## 1 MULTIPOLYGON (((-86.49677 3...
## 2 MULTIPOLYGON (((-86.5778 33...
## 3 MULTIPOLYGON (((-85.18413 3...
## 4 MULTIPOLYGON (((-86.51734 3...
## 5 MULTIPOLYGON (((-88.13999 3...
## 6 MULTIPOLYGON (((-85.41644 3...
counties<- counties %>% dplyr::mutate(COUNTYFIPS = paste(STATE, COUNTY, sep = ""))
head(counties)## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry COUNTYFIPS
## 1 MULTIPOLYGON (((-86.49677 3... 01001
## 2 MULTIPOLYGON (((-86.5778 33... 01009
## 3 MULTIPOLYGON (((-85.18413 3... 01017
## 4 MULTIPOLYGON (((-86.51734 3... 01021
## 5 MULTIPOLYGON (((-88.13999 3... 01033
## 6 MULTIPOLYGON (((-85.41644 3... 01045
#Assessing number of rows in counties and test_innerjoin_wtmean
counties %>% summarise(n_distinct(COUNTYFIPS)) #3109## n_distinct(COUNTYFIPS)
## 1 3109
test_innerjoin_wtmean %>% summarise(n_distinct(COUNTYFIPS)) #3140## n_distinct(COUNTYFIPS)
## 1 3140
#So 31 rows appear to be different between counties and test_innerjoin_wtmean
setdiff(counties$COUNTYFIPS, test_innerjoin_wtmean$COUNTYFIPS)## [1] "46113" "51515"
#Yields 2 countyfips, 46113 and 51515 that are different, both present in counties but not in test_innerjoin_wtmean
#These are:
#Shannon County, SD (FIPS code = 46113) was renamed Oglala Lakota County and assigned anew FIPS code (46102) effective in 2014.
#Bedford City, VA (FIPS code=51515). In 2013, Bedford City, an independent city, merged with Bedford county (FIPS code=51019). Bedford City has a category code in all three of the NCHS schemes. Bedford City category codes are the same as those for Bedford County.
#But what about the other rows that are different? What are those rows?
#Using the alternate sequence for setdiff, with test_innerjoin_wt mean first:
setdiff(test_innerjoin_wtmean$COUNTYFIPS, counties$COUNTYFIPS)## [1] "02013" "02016" "02020" "02050" "02060" "02068" "02070" "02090" "02100"
## [10] "02105" "02110" "02122" "02130" "02150" "02164" "02170" "02180" "02185"
## [19] "02188" "02195" "02198" "02220" "02230" "02240" "02261" "02275" "02282"
## [28] "02290" "15001" "15003" "15005" "15007" "15009"
#Yields 33 differences, pasted here, that are present in test_innerjoin_wtmean but not present in counties:
#1 "02013" "02016" "02020" "02050" "02060" "02068"
#7 "02070" "02090" "02100" "02105" "02110" "02122"
#13 "02130" "02150" "02164" "02170" "02180" "02185"
#19 "02188" "02195" "02198" "02220" "02230" "02240"
#25 "02261" "02275" "02282" "02290" "15001" "15003"
#31 "15005" "15007" "15009"
#These differences are for Alaska (State code 02) and Hawaii (State code 15). These states are not represented in counties.
#An innerjoin of counties and test_innerjoin_wtmean would likely omit the Hawaii and Alaska data that is in test_innerjoin_wtmean.test_counties_inner<- inner_join(counties, test_innerjoin_wtmean, by = 'COUNTYFIPS')
str(test_counties_inner) #3107 obs of 20 variables, noting that Hawaii and Alaska data, as well as Shannon County, SD and Bedford City, VA data, have been removed## Classes 'sf' and 'data.frame': 3107 obs. of 20 variables:
## $ GEO_ID : Factor w/ 3221 levels "0500000US01001",..: 1 5 9 11 17 23 26 33 40 42 ...
## $ STATE : chr "01" "01" "01" "01" ...
## $ COUNTY : Factor w/ 325 levels "001","003","005",..: 1 6 13 16 23 30 34 43 53 55 ...
## $ NAME : Factor w/ 1909 levels "Abbeville","Acadia",..: 89 174 309 341 387 460 567 730 986 1009 ...
## $ LSAD : Factor w/ 8 levels "Borough","CA",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ CENSUSAREA : num 594 645 597 693 593 ...
## $ COUNTYFIPS : chr "01001" "01009" "01017" "01021" ...
## $ COUNTY.x : chr "Autauga County" "Blount County" "Chambers County" "Chilton County" ...
## $ STATE.x : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ REGION.x : chr "South" "South" "South" "South" ...
## $ County_Pop : num 55221 57710 34079 43819 54444 ...
## $ ACS_MEDIAN_HH_INC : num 51281 45813 34177 41627 40576 ...
## $ ACS_PER_CAPITA_INC : num 24974 20532 21071 21399 22546 ...
## $ ACS_PCT_POV_AIAN : num 52.2 6.6 36.84 6.33 18.24 ...
## $ ACS_MEDIAN_HOME_VALUE: num 141300 119800 80800 100100 99400 ...
## $ ACS_PCT_UNINSURED : num 10.1 11.6 13.9 16 11.1 ...
## $ CHR_PCT_ADULT_OBESITY: num 37.5 34.4 40.3 35.3 32.5 37.3 36.1 42 35 33.6 ...
## $ CHR_PCT_DIABETES : num 14.2 14.3 17 14.7 17.6 15.6 14.6 16.1 14.9 12.4 ...
## $ wm_FARA : num 54.8 13.4 30.7 13.9 52.8 ...
## $ geometry :sfc_MULTIPOLYGON of length 3107; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:10, 1:2] -86.5 -86.7 -86.8 -86.9 -86.9 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:19] "GEO_ID" "STATE" "COUNTY" "NAME" ...
View(test_counties_inner)#Exploratory data analysis - sorting data, top/bottom rates and corresponding counties
#Starting with low access...
colnames(test_counties_inner)## [1] "GEO_ID" "STATE" "COUNTY"
## [4] "NAME" "LSAD" "CENSUSAREA"
## [7] "COUNTYFIPS" "COUNTY.x" "STATE.x"
## [10] "REGION.x" "County_Pop" "ACS_MEDIAN_HH_INC"
## [13] "ACS_PER_CAPITA_INC" "ACS_PCT_POV_AIAN" "ACS_MEDIAN_HOME_VALUE"
## [16] "ACS_PCT_UNINSURED" "CHR_PCT_ADULT_OBESITY" "CHR_PCT_DIABETES"
## [19] "wm_FARA" "geometry"
#Inspecting a trimmed-down dataframe
selection<- test_counties_inner %>% dplyr::select(COUNTY.x, STATE.x, wm_FARA)
View(selection)#RESUME HERE###
#Mean low access percentage
mean_LA<- test_counties_inner %>% dplyr::summarise(Mean_LA_County = mean(wm_FARA, na.rm = TRUE))
mean_LA #Mean county low access was 38.16%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Mean_LA_County geometry
## 1 38.1614 MULTIPOLYGON (((-85.56778 4...
#Median low access percentage
median_LA<- test_counties_inner %>% dplyr::summarise(Median_LA_County = median(wm_FARA, na.rm = TRUE))
median_LA #Median county low access was 36.44%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Median_LA_County geometry
## 1 36.44304 MULTIPOLYGON (((-85.56778 4...
#How many counties have 100% of population with low access?
y<-test_counties_inner$wm_FARA
counttoplowaccess<-length(which(y == 100))
counttoplowaccess #55 counties have 100% of the population with low access to food## [1] 55
#Lowest access counties identified
Lowest_access_counties<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% slice_max(wm_FARA)
Lowest_access_counties## Simple feature collection with 55 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -120.4952 ymin: 29.6235 xmax: -79.31228 ymax: 48.99902
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 20075 Midwest Kansas Hamilton County 100
## 2 20203 Midwest Kansas Wichita County 100
## 3 30037 West Montana Golden Valley County 100
## 4 31183 Midwest Nebraska Wheeler County 100
## 5 30059 West Montana Meagher County 100
## 6 30069 West Montana Petroleum County 100
## 7 30107 West Montana Wheatland County 100
## 8 31009 Midwest Nebraska Blaine County 100
## 9 31085 Midwest Nebraska Hayes County 100
## 10 31117 Midwest Nebraska McPherson County 100
## geometry
## 1 MULTIPOLYGON (((-101.5675 3...
## 2 MULTIPOLYGON (((-101.5671 3...
## 3 MULTIPOLYGON (((-108.7801 4...
## 4 MULTIPOLYGON (((-98.29576 4...
## 5 MULTIPOLYGON (((-110.2819 4...
## 6 MULTIPOLYGON (((-108.6309 4...
## 7 MULTIPOLYGON (((-109.6543 4...
## 8 MULTIPOLYGON (((-99.68696 4...
## 9 MULTIPOLYGON (((-100.7774 4...
## 10 MULTIPOLYGON (((-101.2697 4...
View(Lowest_access_counties)#In what counties do more than 75% of the population have low food access?
yseventyfive<- test_counties_inner$wm_FARA
countseventyfive<- length(which(yseventyfive > 75))
countseventyfive #In 165 counties greater than 75% of the population has low food access## [1] 165
Topquartile<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% filter(wm_FARA > 75)
Topquartile## Simple feature collection with 165 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 08014 West Colorado Broomfield County 79.02029
## 2 12111 South Florida St. Lucie County 78.43846
## 3 13059 South Georgia Clarke County 76.37750
## 4 08111 West Colorado San Juan County 99.24000
## 5 20049 Midwest Kansas Elk County 99.36000
## 6 20075 Midwest Kansas Hamilton County 100.00000
## 7 20203 Midwest Kansas Wichita County 100.00000
## 8 30037 West Montana Golden Valley County 100.00000
## 9 31163 Midwest Nebraska Sherman County 83.43000
## 10 31171 Midwest Nebraska Thomas County 99.94000
## geometry
## 1 MULTIPOLYGON (((-105.1473 3...
## 2 MULTIPOLYGON (((-80.67786 2...
## 3 MULTIPOLYGON (((-83.36003 3...
## 4 MULTIPOLYGON (((-107.7383 3...
## 5 MULTIPOLYGON (((-96.52569 3...
## 6 MULTIPOLYGON (((-101.5675 3...
## 7 MULTIPOLYGON (((-101.5671 3...
## 8 MULTIPOLYGON (((-108.7801 4...
## 9 MULTIPOLYGON (((-98.74853 4...
## 10 MULTIPOLYGON (((-100.8461 4...
View(Topquartile)#What regions have the lowest food access?
Topregions<- Topquartile %>% dplyr::summarise(count(REGION.x))
Topregions## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 67 MULTIPOLYGON (((-101.2277 4...
## 2 Northeast 1 MULTIPOLYGON (((-101.2277 4...
## 3 South 55 MULTIPOLYGON (((-101.2277 4...
## 4 West 42 MULTIPOLYGON (((-101.2277 4...
Topregions_summary<- dplyr::arrange(Topregions, desc(freq))
Topregions_summary## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.06684 xmax: -71.2248 ymax: 49.00027
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 67 MULTIPOLYGON (((-101.2277 4...
## 2 South 55 MULTIPOLYGON (((-101.2277 4...
## 3 West 42 MULTIPOLYGON (((-101.2277 4...
## 4 Northeast 1 MULTIPOLYGON (((-101.2277 4...
#Ranked by number of counties with >75% of population with low food access: Midwest with 67 counties; South with 55 counties; West with 42 counties, Northeast with 1 county, for 165 counties total #How many counties have 0% of population with low access (that is, they are high food access)?
y1<-test_counties_inner$wm_FARA
countleastlowaccess<-length(which(y1 == 0))
countleastlowaccess #38 counties have 0% of the population with low access to food## [1] 38
#Highest access counties (0% of population is low access)
Highest_access_counties<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, wm_FARA) %>% slice_min(wm_FARA)
Highest_access_counties## Simple feature collection with 38 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -105.6903 ymin: 31.43369 xmax: -76.24528 ymax: 45.21074
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION.x STATE.x COUNTY.x wm_FARA
## 1 08047 West Colorado Gilpin County 0
## 2 17035 Midwest Illinois Cumberland County 0
## 3 18115 Midwest Indiana Ohio County 0
## 4 21077 South Kentucky Gallatin County 0
## 5 37103 South North Carolina Jones County 0
## 6 47057 South Tennessee Grainger County 0
## 7 48379 South Texas Rains County 0
## 8 48425 South Texas Somervell County 0
## 9 13109 South Georgia Evans County 0
## 10 13119 South Georgia Franklin County 0
## geometry
## 1 MULTIPOLYGON (((-105.3978 3...
## 2 MULTIPOLYGON (((-88.01212 3...
## 3 MULTIPOLYGON (((-84.84945 3...
## 4 MULTIPOLYGON (((-84.66011 3...
## 5 MULTIPOLYGON (((-77.47372 3...
## 6 MULTIPOLYGON (((-83.66746 3...
## 7 MULTIPOLYGON (((-95.66539 3...
## 8 MULTIPOLYGON (((-97.86486 3...
## 9 MULTIPOLYGON (((-81.96907 3...
## 10 MULTIPOLYGON (((-83.35527 3...
View(Highest_access_counties)#Double-checking to make sure there are no counties that are missing low access values
y2<-test_counties_inner$wm_FARA
countmissinglowaccess<-length(which(is.na(y2)))
countmissinglowaccess #0 counties are missing values for low access## [1] 0
#Continuing exploratory data analysis with diabetes...
#Inspecting a trimmed-down dataframe
selection_dm<- test_counties_inner %>% dplyr::select(COUNTY.x, STATE.x, CHR_PCT_DIABETES)
View(selection_dm)#Mean percentage of population with diabetes
mean_DM<- test_counties_inner %>% dplyr::summarise(Mean_DM_County = mean(CHR_PCT_DIABETES, na.rm = TRUE))
mean_DM #Across counties, mean percentage of population with diabetes was 11.66%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Mean_DM_County geometry
## 1 11.6589 MULTIPOLYGON (((-85.56778 4...
#Median percentage of population with diabetes
median_DM<- test_counties_inner %>% dplyr::summarise(Median_DM_County = median(CHR_PCT_DIABETES, na.rm = TRUE))
median_DM #Across counties, median perentage of population with diabetes was 11.5%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## Median_DM_County geometry
## 1 11.5 MULTIPOLYGON (((-85.56778 4...
#What were the top 100 U.S. counties in terms of greatest percentage of population with diabetes in 2015?
Diabetes_high<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, CHR_PCT_DIABETES) %>% slice_max(CHR_PCT_DIABETES, n = 100)
View(Diabetes_high) #Note that search for top 100 actually returns 109 entries given a number of counties are tied in their population percentages with diabetes#What regions had the greatest number of counties in the top 100 in terms of percentage of population with diabetes in 2015?
Topregions_DM<- Diabetes_high %>% dplyr::summarize(count(REGION.x) %>% dplyr::arrange(desc(freq)))
View(Topregions_DM)
#South and Midwest. #105 of the top 109 counties in terms of highest percentages of diabetes are in the South. 4 are in the Midwest.#What are the 100 counties with the lowest percentages of their populations with diabetes?
Diabetes_low<- test_counties_inner %>% dplyr::select(COUNTYFIPS, REGION.x, STATE.x, COUNTY.x, CHR_PCT_DIABETES) %>% slice_min(CHR_PCT_DIABETES, n = 100)
View(Diabetes_low) #Note that search for lowest 100 counties actually returns 111 entries given a number of counties are tied in their population percentages with diabetes#What regions are represented by the bottom 100 counties, those counties that have the lowest percentage of population with diabetes?
Bottomregions_DM<- Diabetes_low %>% dplyr::summarize(count(REGION.x) %>% dplyr::arrange(desc(freq)))
View(Bottomregions_DM)
#West (69 counties), Midwest (25 counties), Northeast (9 counties), South (8 counties)#Double-checking to verify that no counties are missing values for diabetes prevalence
y_dm<-test_counties_inner$CHR_PCT_DIABETES
countmissingdm<-length(which(is.na(y_dm)))
countmissingdm #0 counties are missing values for diabetes prevalence## [1] 0
#Exploratory data analysis - getting set up for choropleths and leaflets
library(RColorBrewer)
my_theme <- function() {
theme_minimal() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_line(color = "white"),
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12),
plot.title = element_text(size = 16))
}
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd"))
#colors
library(leaflet)
# Select a color palette with which to run the palette function
pal_fun <- colorNumeric("BuPu", NULL) # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL) # Yellow-Orange-Red from RColorBrewer
pal_fun3 <- colorNumeric("viridis", NULL) # viridis from viridis
pal_fun4 <- colorNumeric("inferno", NULL) # inferno from viridis
pal_fun5 <- colorNumeric("inferno", NULL, reverse=TRUE) # reverses the color ramp#Choropleths (welcome any suggestions for making them prettier, less clunky, noting scalebar overlaying the choropleth, county colors look a bid faded...)
#Low access choropleth, and saving same as png
choro_lowaccess<- ggplot() +
geom_sf(data = test_counties_inner, aes(fill = wm_FARA), lwd = 0) +
my_theme() +
ggtitle("Percent of population with low food access, by U.S. county, 2015") +
scale_fill_continuous(low = "antiquewhite2", high = "red", guide = "colorbar") +
north(x.min = -75.28031, y.min = 39.86747, # add north bar with north()
x.max = -74.95575, y.max = 40.13793, # set map boundaries with x.min, y.min, etc..
symbol=12, # select north arrow symbol
anchor = c(x = -75, y = 39.93)) + # set north bar location
scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
x.max = -74.95575, y.max = 40.13793, # set map boundaries
dist = 5, dist_unit = 'km', # set scalebar segment length = 5km
transform = TRUE, # TRUE for decimal degree coordinates
model = "WGS84") # specify CR
choro_lowaccess#Diabetes choropleth, and saving same as png
choro_diabetes<- ggplot() +
geom_sf(data = test_counties_inner, aes(fill = CHR_PCT_DIABETES), lwd = 0) +
my_theme() +
ggtitle("Percent of population with diabetes, by U.S. county, 2015") +
scale_fill_continuous(low = "antiquewhite2", high = "red", guide = "colorbar") +
north(x.min = -75.28031, y.min = 39.86747, # add north bar with north()
x.max = -74.95575, y.max = 40.13793, # set map boundaries with x.min, y.min, etc..
symbol=12, # select north arrow symbol
anchor = c(x = -75, y = 39.93)) + # set north bar location
scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
x.max = -74.95575, y.max = 40.13793, # set map boundaries
dist = 5, dist_unit = 'km', # set scalebar segment length = 5km
transform = TRUE, # TRUE for decimal degree coordinates
model = "WGS84") # specify CR
choro_diabetes#Leaflet for low access
#Popup message - low access
pu_message1 <- paste0(test_counties_inner$COUNTY.x, ", ", test_counties_inner$STATE.x, "<br>Low food access percentage 2015: ", round(test_counties_inner$wm_FARA, 1), "%")
#Interactive map with legend and scalebar, and saving image
leaflet_lowaccess<-leaflet(test_counties_inner) %>%
addPolygons(fillColor = ~pal_fun3(wm_FARA),
popup = pu_message1) %>%
addTiles() %>% addLegend("bottomright", pal=pal_fun3, values = ~wm_FARA,
title = '2015 Low food access percentages by U.S. county',
opacity = 1) %>% addScaleBar()
leaflet_lowaccess#Leaflet for diabetes prevalence
#Popup message - diabetes
pu_message2 <- paste0(test_counties_inner$COUNTY.x, ", ", test_counties_inner$STATE.x, "<br>Diabetes prevalence 2015: ", round(test_counties_inner$CHR_PCT_DIABETES, 1), "%")
#Interactive map with legend and scalebar
leaflet_diabetes<- leaflet(test_counties_inner) %>%
addPolygons(fillColor = ~pal_fun3(CHR_PCT_DIABETES),
popup = pu_message2) %>%
addTiles() %>% addLegend("bottomright", pal=pal_fun3, values = ~CHR_PCT_DIABETES,
title = '2015 Diabetes prevalence rates by U.S. county',
opacity = 1) %>% addScaleBar()
leaflet_diabetes#Scatterplot, correlation between low access percentage and diabetes percentage
#Which, if any, of the following are worth keeping? Suggestions welcome.
library(modelsummary)##
## Attaching package: 'modelsummary'
## The following object is masked from 'package:tidycensus':
##
## get_estimates
#Scatterplot
ggplot(test_counties_inner, aes(x = wm_FARA, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") + ggtitle("Scatterplot: Low food access vs. Diabetes prevalence")#Correlation
#Pearson
cor(test_counties_inner$wm_FARA, test_counties_inner$CHR_PCT_DIABETES, use = "complete.obs") ## [1] -0.3060672
#Spearman
cor(test_counties_inner$wm_FARA, test_counties_inner$CHR_PCT_DIABETES, use = "complete.obs", method = "spearman") #Rank-based correlation## [1] -0.3258509
#Create a logistic model
#Which variable sequence is correct?
LA_DM_lm_fit <- lm(CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
LA_DM_lm_fit##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
##
## Coefficients:
## (Intercept) wm_FARA
## 12.93230 -0.03337
#Summarize logistic model
summary_LA_DM_lm_fit <- summary.lm(LA_DM_lm_fit) #Summary of results
summary_LA_DM_lm_fit##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7474 -1.7094 -0.0926 1.5925 9.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.932301 0.083690 154.53 <2e-16 ***
## wm_FARA -0.033369 0.001863 -17.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.462 on 3105 degrees of freedom
## Multiple R-squared: 0.09368, Adjusted R-squared: 0.09339
## F-statistic: 320.9 on 1 and 3105 DF, p-value: < 2.2e-16
#Retrieving output from summary statistics for model fit
names(summary_LA_DM_lm_fit) ## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary_LA_DM_lm_fit$adj.r.squared## [1] 0.09338522
#Retrieving specific elements from the model statistics
coef(LA_DM_lm_fit) #We recover the intercept and coefficient for x that are expected## (Intercept) wm_FARA
## 12.93230118 -0.03336884
confint(LA_DM_lm_fit) #Confidence intervals for fit## 2.5 % 97.5 %
## (Intercept) 12.76820735 13.09639502
## wm_FARA -0.03702103 -0.02971666
ggplot(test_counties_inner, aes(x = wm_FARA, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") +
geom_smooth(method = "lm", color = "green4")## `geom_smooth()` using formula 'y ~ x'
#Model summary provides a nicer table of output
modelsummary(LA_DM_lm_fit, statistic = "p.value") | Model 1 | |
|---|---|
| (Intercept) | 12.932 |
| (0.000) | |
| wm_FARA | −0.033 |
| (0.000) | |
| Num.Obs. | 3107 |
| R2 | 0.094 |
| R2 Adj. | 0.093 |
| AIC | 14420.8 |
| BIC | 14438.9 |
| Log.Lik. | −7207.376 |
| RMSE | 2.46 |
summary(LA_DM_lm_fit)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7474 -1.7094 -0.0926 1.5925 9.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.932301 0.083690 154.53 <2e-16 ***
## wm_FARA -0.033369 0.001863 -17.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.462 on 3105 degrees of freedom
## Multiple R-squared: 0.09368, Adjusted R-squared: 0.09339
## F-statistic: 320.9 on 1 and 3105 DF, p-value: < 2.2e-16
#How best to interpret these findings? As food access becomes more limited in a county, prevalence of diabetes decreases?#Assessing other variables that may have a relationship to diabetes prevalence, beyond low access to food...
#Median household income
ggplot(test_counties_inner, aes(x = ACS_MEDIAN_HH_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Median household income relationship to diabetes prevalence")## Warning: Removed 1 rows containing missing values (geom_point).
summary((lm(CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4296 -1.3736 -0.0069 1.3735 7.4564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.729e+01 1.524e-01 113.42 <2e-16 ***
## ACS_MEDIAN_HH_INC -1.207e-04 3.163e-06 -38.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.134 on 3104 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3193, Adjusted R-squared: 0.3191
## F-statistic: 1456 on 1 and 3104 DF, p-value: < 2.2e-16
#Per capita income
ggplot(test_counties_inner, aes(x = ACS_PER_CAPITA_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Per capita income relationship to diabetes prevalence")summary((lm(CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8249 -1.4033 0.0386 1.3774 8.5143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.752e+01 1.693e-01 103.47 <2e-16 ***
## ACS_PER_CAPITA_INC -2.413e-04 6.786e-06 -35.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.18 on 3105 degrees of freedom
## Multiple R-squared: 0.2895, Adjusted R-squared: 0.2892
## F-statistic: 1265 on 1 and 3105 DF, p-value: < 2.2e-16
#Obesity
ggplot(test_counties_inner, aes(x = CHR_PCT_ADULT_OBESITY, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") +
ggtitle("Obesity rate relationship to diabetes prevalence") summary((lm(CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1775 -1.2811 -0.0812 1.2155 7.1327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.507953 0.243063 -2.09 0.0367 *
## CHR_PCT_ADULT_OBESITY 0.379292 0.007501 50.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.915 on 3105 degrees of freedom
## Multiple R-squared: 0.4516, Adjusted R-squared: 0.4514
## F-statistic: 2557 on 1 and 3105 DF, p-value: < 2.2e-16
#Uninsured status
ggplot(test_counties_inner, aes(x = ACS_PCT_UNINSURED, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Uninsured status relationship to diabetes prevalence")summary((lm(CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_inner)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9911 -1.7029 -0.0521 1.6202 8.8322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.077338 0.122469 82.28 <2e-16 ***
## ACS_PCT_UNINSURED 0.118761 0.008552 13.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 3105 degrees of freedom
## Multiple R-squared: 0.05847, Adjusted R-squared: 0.05817
## F-statistic: 192.8 on 1 and 3105 DF, p-value: < 2.2e-16
#Combined linear models
#Including median income
lm_medincome <- lm(CHR_PCT_DIABETES ~ wm_FARA + ACS_MEDIAN_HH_INC, data = test_counties_inner)
summary(lm_medincome)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + ACS_MEDIAN_HH_INC,
## data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8978 -1.3902 -0.0317 1.3951 7.3582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.763e+01 1.513e-01 116.52 <2e-16 ***
## wm_FARA -2.016e-02 1.622e-03 -12.43 <2e-16 ***
## ACS_MEDIAN_HH_INC -1.115e-04 3.174e-06 -35.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.083 on 3103 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3516, Adjusted R-squared: 0.3512
## F-statistic: 841.2 on 2 and 3103 DF, p-value: < 2.2e-16
modelsummary(lm_medincome)| Model 1 | |
|---|---|
| (Intercept) | 17.632 |
| (0.151) | |
| wm_FARA | −0.020 |
| (0.002) | |
| ACS_MEDIAN_HH_INC | 0.000 |
| (0.000) | |
| Num.Obs. | 3106 |
| R2 | 0.352 |
| R2 Adj. | 0.351 |
| AIC | 13378.4 |
| BIC | 13402.6 |
| Log.Lik. | −6685.208 |
| RMSE | 2.08 |
#Including per capita income
lm_percapincome <- lm(CHR_PCT_DIABETES ~ wm_FARA + ACS_PER_CAPITA_INC, data = test_counties_inner)
summary(lm_percapincome)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + ACS_PER_CAPITA_INC,
## data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1861 -1.4174 0.0288 1.3942 7.5393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.783e+01 1.670e-01 106.76 <2e-16 ***
## wm_FARA -2.085e-02 1.654e-03 -12.61 <2e-16 ***
## ACS_PER_CAPITA_INC -2.214e-04 6.806e-06 -32.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.127 on 3104 degrees of freedom
## Multiple R-squared: 0.3241, Adjusted R-squared: 0.3236
## F-statistic: 744 on 2 and 3104 DF, p-value: < 2.2e-16
#Including obesity
lm_obesity <- lm(CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY, data = test_counties_inner)
summary(lm_obesity)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + CHR_PCT_ADULT_OBESITY,
## data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8289 -1.2325 -0.0362 1.1769 7.4004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.902135 0.255513 3.531 0.000421 ***
## wm_FARA -0.020350 0.001429 -14.237 < 2e-16 ***
## CHR_PCT_ADULT_OBESITY 0.359543 0.007400 48.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.856 on 3104 degrees of freedom
## Multiple R-squared: 0.4852, Adjusted R-squared: 0.4849
## F-statistic: 1463 on 2 and 3104 DF, p-value: < 2.2e-16
#Including uninsured status
lm_uninsured <- lm(CHR_PCT_DIABETES ~ wm_FARA + ACS_PCT_UNINSURED, data = test_counties_inner)
summary(lm_uninsured)##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ wm_FARA + ACS_PCT_UNINSURED,
## data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.355 -1.623 -0.042 1.519 8.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.372705 0.136556 83.28 <2e-16 ***
## wm_FARA -0.032800 0.001806 -18.16 <2e-16 ***
## ACS_PCT_UNINSURED 0.115481 0.008134 14.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.386 on 3104 degrees of freedom
## Multiple R-squared: 0.1489, Adjusted R-squared: 0.1484
## F-statistic: 271.6 on 2 and 3104 DF, p-value: < 2.2e-16
#How about a model that accounts for all of the above variables?
#Trying...
lm_combinedvars<- lm(wm_FARA ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC + ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, data = test_counties_inner)
#Which of the following is the best output?
modelsummary(lm_combinedvars, statistic = "p.value")| Model 1 | |
|---|---|
| (Intercept) | 28.942 |
| (0.000) | |
| CHR_PCT_DIABETES | −2.802 |
| (0.000) | |
| ACS_MEDIAN_HH_INC | 0.000 |
| (0.873) | |
| ACS_PER_CAPITA_INC | 0.001 |
| (0.000) | |
| CHR_PCT_ADULT_OBESITY | 0.504 |
| (0.000) | |
| ACS_PCT_UNINSURED | 0.584 |
| (0.000) | |
| Num.Obs. | 3106 |
| R2 | 0.114 |
| R2 Adj. | 0.113 |
| AIC | 28114.9 |
| BIC | 28157.2 |
| Log.Lik. | −14050.459 |
| RMSE | 22.30 |
modelsummary(lm_combinedvars)| Model 1 | |
|---|---|
| (Intercept) | 28.942 |
| (5.551) | |
| CHR_PCT_DIABETES | −2.802 |
| (0.231) | |
| ACS_MEDIAN_HH_INC | 0.000 |
| (0.000) | |
| ACS_PER_CAPITA_INC | 0.001 |
| (0.000) | |
| CHR_PCT_ADULT_OBESITY | 0.504 |
| (0.126) | |
| ACS_PCT_UNINSURED | 0.584 |
| (0.090) | |
| Num.Obs. | 3106 |
| R2 | 0.114 |
| R2 Adj. | 0.113 |
| AIC | 28114.9 |
| BIC | 28157.2 |
| Log.Lik. | −14050.459 |
| RMSE | 22.30 |
lm_combinedvars##
## Call:
## lm(formula = wm_FARA ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC +
## ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED,
## data = test_counties_inner)
##
## Coefficients:
## (Intercept) CHR_PCT_DIABETES ACS_MEDIAN_HH_INC
## 28.9421388 -2.8015710 0.0000114
## ACS_PER_CAPITA_INC CHR_PCT_ADULT_OBESITY ACS_PCT_UNINSURED
## 0.0007160 0.5043941 0.5838058
summary(lm_combinedvars) #I tend to prefer summary##
## Call:
## lm(formula = wm_FARA ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC +
## ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED,
## data = test_counties_inner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.831 -16.746 -1.076 14.540 72.647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.894e+01 5.551e+00 5.214 1.97e-07 ***
## CHR_PCT_DIABETES -2.802e+00 2.309e-01 -12.133 < 2e-16 ***
## ACS_MEDIAN_HH_INC 1.140e-05 7.121e-05 0.160 0.873
## ACS_PER_CAPITA_INC 7.160e-04 1.558e-04 4.594 4.51e-06 ***
## CHR_PCT_ADULT_OBESITY 5.044e-01 1.260e-01 4.003 6.41e-05 ***
## ACS_PCT_UNINSURED 5.838e-01 8.985e-02 6.498 9.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.32 on 3100 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.114, Adjusted R-squared: 0.1125
## F-statistic: 79.74 on 5 and 3100 DF, p-value: < 2.2e-16
#How to properly interpret these findings? #How to state conclusions about relationship between Low access and Diabetes prevalence, and these other variables appropriately?#END#